从3D点云中对可遍历区域和感兴趣的对象的感知是自主导航中的关键任务之一。一辆地面车辆需要寻找可以通过车轮探索的可遍历的地形。然后,为了做出安全的导航决定,必须跟踪位于这些地形上的物体的分割。但是,过度分割和分割不足可能会对此类导航决策产生负面影响。为此,我们提出了旅行,该行程使用3D点云的图表表示可遍历的地面检测和对象聚类。为了将可穿越的接地段分割,将点云编码为图形结构,即三个格里德字段,该场将每个三个格里德视为节点。然后,通过检查连接节点的边缘的局部凸度和凹度来搜索和重新定义可遍历的区域。另一方面,我们的地上对象分割通过表示球形预测空间中的一组水平相邻的3D点作为节点和节点之间的垂直/水平关系,以使用图形结构。充分利用节点边缘结构,上面的分割可确保实时操作并减轻过度分割。通过使用模拟,城市场景和我们自己的数据集的实验,我们已经证明,根据常规指标,我们提出的遍历地面分割算法优于其他最新方法,并且我们新提出的评估指标对于评估是有意义的地上细分。我们将在https://github.com/url-kaist/travel上向公开提供代码和自己的数据集。
translated by 谷歌翻译
弱监督的对象检测(WSOD)是一项任务,可使用仅在图像级注释上训练的模型来检测图像中的对象。当前的最新模型受益于自我监督的实例级别的监督,但是由于弱监督不包括计数或位置信息,因此最常见的``Argmax''标签方法通常忽略了许多对象实例。为了减轻此问题,我们提出了一种新颖的多个实例标记方法,称为对象发现。我们进一步在弱监督下引入了新的对比损失,在该监督下,没有实例级信息可用于采样,称为弱监督对比损失(WSCL)。WSCL旨在通过利用一致的功能来嵌入同一类中的向量来构建对象发现的可靠相似性阈值。结果,我们在2014年和2017年MS-Coco以及Pascal VOC 2012上取得了新的最新结果,并在Pascal VOC 2007上取得了竞争成果。
translated by 谷歌翻译
我们提出了一种新方法,用于近似于基于假设标记的候选数据点进行重新培训的主动学习获取策略。尽管这通常与深层网络不可行,但我们使用神经切线内核来近似重新进行重新培训的结果,并证明该近似值即使在主动学习设置中也无效 - 近似于“ look-aead abead”选择标准,所需的计算要少得多。 。这也使我们能够进行顺序的主动学习,即在流态中更新模型,而无需在添加每个新数据点后使用SGD重新训练模型。此外,我们的查询策略可以更好地理解模型的预测将如何通过与标准(“近视”)标准相比,通过大幅度击败其他查看策略,并获得相等或更好的性能,并取得了相等或更好的性能。基于池的主动学习中的几个基准数据集上的最新方法。
translated by 谷歌翻译
最近,基于深度神经网络(DNN)的药物 - 目标相互作用(DTI)模型以高精度突出显示,具有实惠的计算成本。然而,模型在硅药物发现的实践中仍然是一个具有挑战性的问题。我们提出了两项​​关键策略,以提高DTI模型的概括。首先是通过用神经网络参数化的物理通知方程来预测原子原子对相互作用,并提供蛋白质 - 配体复合物作为其总和的总结合亲和力。通过增强更广泛的绑定姿势和配体来培训数据,我们进一步改善了模型泛化。我们验证了我们的模型,PIGNET,在评分职能(CASF)2016的比较评估中,展示了比以前的方法更优于对接和筛选力。我们的物理信息策略还通过可视化配体副结构的贡献来解释预测的亲和力,为进一步配体优化提供了见解。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Interview has been regarded as one of the most crucial step for recruitment. To fully prepare for the interview with the recruiters, job seekers usually practice with mock interviews between each other. However, such a mock interview with peers is generally far away from the real interview experience: the mock interviewers are not guaranteed to be professional and are not likely to behave like a real interviewer. Due to the rapid growth of online recruitment in recent years, recruiters tend to have online interviews, which makes it possible to collect real interview data from real interviewers. In this paper, we propose a novel application named EZInterviewer, which aims to learn from the online interview data and provides mock interview services to the job seekers. The task is challenging in two ways: (1) the interview data are now available but still of low-resource; (2) to generate meaningful and relevant interview dialogs requires thorough understanding of both resumes and job descriptions. To address the low-resource challenge, EZInterviewer is trained on a very small set of interview dialogs. The key idea is to reduce the number of parameters that rely on interview dialogs by disentangling the knowledge selector and dialog generator so that most parameters can be trained with ungrounded dialogs as well as the resume data that are not low-resource. Evaluation results on a real-world job interview dialog dataset indicate that we achieve promising results to generate mock interviews. With the help of EZInterviewer, we hope to make mock interview practice become easier for job seekers.
translated by 谷歌翻译
Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译